Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert real(kind_phys) vegetation, slope and soil type arrays into integer arrays without affecting input/output files, update submodule pointer for CMakeModules (updates to FindNetCDF.cmake and FindESMF.cmake), switch to EMC hpc-stack on gaea, update utest/opnReqTest #804

Merged

Conversation

climbfuji
Copy link
Collaborator

@climbfuji climbfuji commented Sep 15, 2021

PR Checklist

  • Ths PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR. Please consult the ufs-weather-model wiki if you are unsure how to do this.

  • This PR has been tested using a branch which is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR

  • An Issue describing the work contained in this PR has been created either in the subcomponent(s) or in the ufs-weather-model. The Issue should be created in the repository that is most relevant to the changes in contained in the PR. The Issue and the dependent sub-component PR
    are specified below.

  • If new or updated input data is required by this PR, it is clearly stated in the text of the PR.

Description

This PR only updates the submodule pointers for fv3atm and ccpp-physics for the changes described in the dependent PRs below (convert real(kind_phys) vegetation, slope and soil type arrays into integer arrays without affecting input/output files).

Also included:

  • Update submodule pointer for CMakeModules (updates to FindNetCDF.cmake and FindESMF.cmake)
  • Gaea only: switch to hpc-stack maintained by EMC (@kgerheiser)
  • Update to utest (now opnReqTest) CI testing from @MinsukJi-NOAA

No changes to the input files.

Update 2021/09/30 The regression test fv3_gsd now has different results with Intel on Hera. It starts at 24h forecast time with a minimal difference in one of the tiles, and then propagates to all tiles. The regression test fv3_gsd_debug (running out to 6h only) is b4b identical. With GNU, both the 48h forecast fv3_gsd and the 6h forecast fv3_gsd_debug are b4b identical. I then ran the DEBUG version out to 48h, and the results between the original code andd this PR were identical. Thus, the b4b difference for fv3_gsd with Intel in PROD mode is due to an optimization round-off difference in RUC LSM.

Update 2021/10/01 Switching to hpc-stack maintained by EMC (@kgerheiser) on gaea changes the "results" for the following regression tests - on gaea only, of course, and only in the Grib2 files produced by the regional inline post:

Dom.Heinzeller@gaea12:/lustre/f2/scratch/Dom.Heinzeller/ufs-weather-model/ufs-weather-model-remove-noah-wrfv4-bugfix-mp-thompson/tests [intel|esrl_bmcs]> cat log_gaea.intel/rt_038_regional_quilt_2threads.log

baseline dir = /lustre/f2/pdata/ncep_shared/emc.nemspara/RT/NEMSfv3gfs/develop-20210928/INTEL/fv3_regional_quilt
working dir  = /lustre/f2/scratch/Dom.Heinzeller/FV3_RT/rt_38214/regional_quilt_2threads
Checking test 038 regional_quilt_2threads results ....
 Comparing dynf000.nc .........OK
 Comparing dynf024.nc .........OK
 Comparing phyf000.nc .........OK
 Comparing phyf024.nc .........OK
 Comparing PRSLEV.GrbF00 .........NOT OK
 Comparing PRSLEV.GrbF24 .........NOT OK
 Comparing NATLEV.GrbF00 .........NOT OK
 Comparing NATLEV.GrbF24 .........NOT OK

 0: The total amount of wall time                        = 267.801442

Test 038 regional_quilt_2threads FAIL

Dom.Heinzeller@gaea12:/lustre/f2/scratch/Dom.Heinzeller/ufs-weather-model/ufs-weather-model-remove-noah-wrfv4-bugfix-mp-thompson/tests [intel|esrl_bmcs]> cat log_gaea.intel/rt_041_regional_quilt_RRTMGP.log

baseline dir = /lustre/f2/pdata/ncep_shared/emc.nemspara/RT/NEMSfv3gfs/develop-20210928/INTEL/fv3_regional_quilt_RRTMGP
working dir  = /lustre/f2/scratch/Dom.Heinzeller/FV3_RT/rt_38214/regional_quilt_RRTMGP
Checking test 041 regional_quilt_RRTMGP results ....
 Comparing dynf000.nc .........OK
 Comparing dynf024.nc .........OK
 Comparing phyf000.nc .........OK
 Comparing phyf024.nc .........OK
 Comparing PRSLEV.GrbF00 .........NOT OK
 Comparing PRSLEV.GrbF24 .........NOT OK
 Comparing NATLEV.GrbF00 .........NOT OK
 Comparing NATLEV.GrbF24 .........NOT OK

 0: The total amount of wall time                        = 469.675394

Test 041 regional_quilt_RRTMGP FAIL

Dom.Heinzeller@gaea12:/lustre/f2/scratch/Dom.Heinzeller/ufs-weather-model/ufs-weather-model-remove-noah-wrfv4-bugfix-mp-thompson/tests [intel|esrl_bmcs]> cat log_gaea.intel/rt_037_regional_quilt.log

baseline dir = /lustre/f2/pdata/ncep_shared/emc.nemspara/RT/NEMSfv3gfs/develop-20210928/INTEL/fv3_regional_quilt
working dir  = /lustre/f2/scratch/Dom.Heinzeller/FV3_RT/rt_38214/regional_quilt
Checking test 037 regional_quilt results ....
 Comparing dynf000.nc .........OK
 Comparing dynf024.nc .........OK
 Comparing phyf000.nc .........OK
 Comparing phyf024.nc .........OK
 Comparing PRSLEV.GrbF00 .........NOT OK
 Comparing PRSLEV.GrbF24 .........NOT OK
 Comparing NATLEV.GrbF00 .........NOT OK
 Comparing NATLEV.GrbF24 .........NOT OK

 0: The total amount of wall time                        = 858.850208

Test 037 regional_quilt FAIL

@WenMeng-NOAA helped me to look at the differences, and they are only in the grib2 header section. The data itself is 100% identical. Please look at the discussion in NOAA-EMC/hpc-stack#337 (towards the bottom) for more details.

Issue(s) addressed

Testing

Initial regression testing

Full regression tests passed on hera.intel against existing baseline.

Final regression testing

Regression tests will be run on all tier-1 platforms:

Dependencies

DeniseWorthen and others added 30 commits March 27, 2021 12:30
This reverts commit 7b826d4.
@climbfuji
Copy link
Collaborator Author

Automated RT Failure Notification
Machine: hera
Compiler: intel
Job: RT
Repo location: /scratch1/NCEPDEV/nems/emc.nemspara/autort/pr/734707651/20211003031514/ufs-weather-model
Please manually delete: /scratch1/NCEPDEV/stmp2/emc.nemspara/FV3_RT/rt_16988
Test datm_cdeps_mx025_gefs 086 failed failed
Test datm_cdeps_mx025_gefs 086 failed in run_test failed
Please make changes and add the following label back:
hera-intel-RT

Something is wrong recently with hera ...

29 min. TEST 086 datm_cdeps_mx025_gefs is running,  status: R jobid 23821372
30 min. TEST 086 datm_cdeps_mx025_gefs is running,  status: R jobid 23821372
Slurm unknown status CG. Check sacct ...
23821372                  TIMEOUT         rt_16988_086
23821372.ba+            CANCELLED                batch
23821372.ex+            COMPLETED               extern
23821372.0                TIMEOUT              fv3.exe
31 min. TEST 086 datm_cdeps_mx025_gefs is TIMEOUT,  status: CG jobid 23821372

@climbfuji climbfuji changed the title Convert real(kind_phys) vegetation, slope and soil type arrays into integer arrays without affecting input/output files, update submodule pointer for CMakeModules (updates to FindNetCDF.cmake and FindESMF.cmake), switch to EMC hpc-stack on gaea Convert real(kind_phys) vegetation, slope and soil type arrays into integer arrays without affecting input/output files, update submodule pointer for CMakeModules (updates to FindNetCDF.cmake and FindESMF.cmake), switch to EMC hpc-stack on gaea, update utest/opnReqTest Oct 4, 2021
@github-actions github-actions bot removed the run-ci label Oct 4, 2021
@climbfuji
Copy link
Collaborator Author

@DusanJovic-NOAA @junwang-noaa @MinsukJi-NOAA I merged in the changes from Minsuk (climbfuji#11) and added the run-ci and hera-intel-rt labels. The submodule pointer for fv3atm is also updated and correct. Please check the PR again while the regression tests / CI tests are running. Thanks!

@climbfuji
Copy link
Collaborator Author

CI tests passed, waiting for final hera-intel RT runs. If those come back successfully, whoever gets to it first please merge.

@BrianCurtis-NOAA
Copy link
Collaborator

Automated RT Failure Notification
Machine: hera
Compiler: intel
Job: RT
Repo location: /scratch1/NCEPDEV/nems/emc.nemspara/autort/pr/734707651/20211004134511/ufs-weather-model
Please manually delete: /scratch1/NCEPDEV/stmp2/emc.nemspara/FV3_RT/rt_2258
Test cpld_restart_c384_p7 009 failed failed
Test cpld_restart_c384_p7 009 failed in run_test failed
Please make changes and add the following label back:
hera-intel-RT

@MinsukJi-NOAA
Copy link
Contributor

MinsukJi-NOAA commented Oct 4, 2021

Automated RT Failure Notification Machine: hera Compiler: intel Job: RT Repo location: /scratch1/NCEPDEV/nems/emc.nemspara/autort/pr/734707651/20211004134511/ufs-weather-model Please manually delete: /scratch1/NCEPDEV/stmp2/emc.nemspara/FV3_RT/rt_2258 Test cpld_restart_c384_p7 009 failed failed Test cpld_restart_c384_p7 009 failed in run_test failed Please make changes and add the following label back: hera-intel-RT

cpld_restart_c384_p7 run never got started and eventually timed out. Will manually run this case.

@MinsukJi-NOAA
Copy link
Contributor

Hera Intel RT passed, and this PR is ready for merge.

@BrianCurtis-NOAA
Copy link
Collaborator

Automated RT Failure Notification
Machine: gaea
Compiler: intel
Job: BL
Repo location: /lustre/f2/pdata/ncep/emc.nemspara/autort/pr/734707651/20211001181505/ufs-weather-model
Please make changes and add the following label back:
gaea-intel-BL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Baseline Updates Current baselines will be updated. Waiting for Reviews The PR is waiting for reviews from associated component PR's.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove real land/sea/ice mask, use integer variable instead Transfer Gaea responsibilities from Dom to EMC
6 participants